-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fix: with_param_values on LogicalPlan::EmptyRelation and LogicalPlan::Values returns incorrect schema
#18286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me 👍
Do you think it's also possible to add the queries from the original issue as tests?
|
Thanks @Jefffrey I tried to add test for the subquery alias: let df = ctx.sql("SELECT a, b FROM (VALUES ($1, $2)) AS t(a, b)").await?;
let df_with_params_replaced = df.with_param_values(vec![
ScalarValue::UInt32(Some(1)),
ScalarValue::Utf8(Some("foofy".to_string())),
])?;
dbg!(df_with_params_replaced.collect().await?[0].schema());
#> Error: ArrowError(InvalidArgumentError("column types must match schema types, expected Null but found UInt32 at column index 0"), Some(""))But the test still fail, the plan after replacing params (the schema for The expected params: I think it needs more works, convert to draft for now. |
ba80e38 to
ac1c953
Compare
ac1c953 to
599bed8
Compare
6a8297e to
8e038e1
Compare
8e038e1 to
bb76688
Compare
38177db to
d3cc940
Compare
with_param_values on EmptyRelation returns incorrect schemawith_param_values on LogicalPlan::EmptyRelation and LogicalPlan::Values returns incorrect schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some changes to handle the value based relation in the issue.
Summary:
-
Always
recompute_schemaafter replacing placeholder, to ensure parent nodes can see changes in the child node's schema. -
Recompute schema for
LogicalPlan::Values: we are using the old schema forLogicalPlan::Valueshere , the change in values' data type can never be reflected to the schema.
I tried to use slt test but couldn't find a way to EXPLAIN a plan after PREPARE.
|
|
||
| /// Test for https://github.com/apache/datafusion/issues/18102 | ||
| #[tokio::test] | ||
| async fn test_query_parameters_in_values_list_relation() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue mentioned two queries.
This tests the first one: SELECT a, b FROM (VALUES ($1, $2)) AS t(a, b)
The second one is verified in the above test: SELECT $1, $2
| if n_cols == 0 { | ||
| return plan_err!("Values list cannot be zero length"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use LogicalPlanBuilder::values to recompute the schema for LogicalPlan::Values
causing this line to fail the test from #12339.
Since we allow values without columns, I think this line should be removed.
| // `old_field`'s data type is unknown but `new_field`'s is known | ||
| if old_field.data_type().is_null() | ||
| && !new_field.data_type().is_null() | ||
| { | ||
| let field = old_field | ||
| .as_ref() | ||
| .clone() | ||
| .with_data_type(new_field.data_type().clone()); | ||
| (table_ref.cloned(), Arc::new(field)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main change, the new schema needs new_field's data type.
What about other attributes?
I tried to apply new_field's nullability, but the test in insert.slt:305 failed,
because its schema require NOT NULL when the new_field is nullable.
CREATE TABLE table_without_values(field1 BIGINT NOT NULL, field2 BIGINT NULL);
-- statement error Invalid argument error: Column 'column1' is declared as non-nullable but contains null values
insert into table_without_values values(NULL, 300);So the new schema shouldn't use new_field's nullability.
However, I wasn't quite sure whether other attributes (i.e. metadata) need to be populated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still trying to wrap my head around what's happening here. From what I can tell, there are two things to consider here:
- When recomputing
new_plan, we lose the column names on the old plan (self) so this is a way of keeping that - If we ignore the nullability from
selfand go withnew_plannullability that also runs into other issue? (Though from what I see this may still cause an error, just later in the pipeline/has a different message)
Am I correct in my understanding here?
I wonder for point 1 if we can do this in a nicer way, perhaps by introducing something like LogicalPlanBuilder::values_with_names to preserve the column names from the start instead of amending after the fact.
For point 2 I'm not sure, perhaps an explicit check upfront is better 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for looking at this
Am I correct in my understanding here?
Correct.
Allow me to further explain why we need self and why we don't need the whole self.
We need self:
- Point 1:
new planuses different name, this is clearly an issue. - Point 2:
new planuses different nullability, this allows
INSERT INTO non_null_table VALUES (NULL)to pass (even though it shouldn't),
because non-nullableselfschema is replaced with nullablenew planschema.
We don't need the whole self:
Let use the sql in this issue SELECT a, b FROM (VALUES ($1, $2)) AS t(a, b) as reference.
After replacing params:
self(t(a,b)) doesn't know aboutaandbdata types.new plan(VALUES ($1, $2)) knows about the data types, but doesn't know aboutaandb.
So, I was trying to apply new data types while keeping everything in self intact.
It was basically just these two lines, but since I couldn't find something similar to
merge_data_type, I ended up with the whole schema migration.
LogicalPlan::Values(Values { schema, values }) => {
+ let new_plan = // compute new plan here
+ schema.merge_data_type(new_plan.schema());
Ok(LogicalPlan::Values(Values { schema, values }))| let schema = DFSchema::new_with_metadata( | ||
| qualified_fields, | ||
| schema.metadata().clone(), | ||
| )? | ||
| .with_functional_dependencies(schema.functional_dependencies().clone())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a new schema here because I didn't know how to modify fields in a schema.
Please tell me if we have APIs for modifying schema's fields (then some clone can be avoided).
| // always recompute the schema to ensure the changed in the schema's field should be | ||
| // poplulated to the plan's parent | ||
| .map_data(|plan| plan.recompute_schema()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to computed even though plan isn't transformed.
Otherwise changes in a child's schema cannot be populated to
its ancestors.
|
|
||
| // replaced | ||
| assert_snapshot!(plan.display_indent_schema(), @r" | ||
| Projection: t.a, t.b, t.c [a:Int32;N, b:Int32;N, c:Int64;N] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the code change, this line would be:
Projection: t.a, t.b, t.c [a:Int32;N, b:Null;N, c:Int64;N]
|
I aim to review this tomorrow (or at least sometime this week) |
Which issue does this PR close?
SubqueryAlias,Values, and/orEmptyRelationhave incorrect schemas after replacingPlaceholdervalues #18102.Rationale for this change
with_param_valuesdoesn't substitute params' type if it is used onEmptyRelation.Thus, causing
SELECT $1, $2to have incorrect schema after substitution.For example: after replacing
$1 = 1, $2 = "s", the schema is[Null, Null], but it shouldbe
[Int64, Utf8].This schema type mismatch is resolved before converting to physical plan by
the
type_coercionrule in theanalyzer.datafusion/datafusion/optimizer/src/analyzer/type_coercion.rs
Line 149 in 8142360
So I'm not quite sure should we fix it in
with_param_values.What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?
No.